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1.  Introduction 


In  many  practical  situations,  the  goal  of  the  experimenter  is  to  compare  two  or  more 
populations  in  order  to  make  a  decision  in  the  form  of  a  ranking  of  the  populations. 
The  best  studied  ranking  goal  concerns  the  best  population  (the  most  efficient  drug  for 
an  ailment,  the  most  effective  manufacturing  process  and  so  on).  The  classical  tests  of 
homogeniety  were  not  designed  to  provide  answers  to  such  questions.  Rejecting  the  null 
hypothesis  is  not  the  final  solution  to  the  experimenter’s  problem  but  an  exercise  that 
underlies  the  need  for  a  reformulation  of  the  problem.  Born  out  of  this  need  is  the  statistical 
theory  of  ranking  and  selection  procedures. 

Ranking  and  selection  problems  have  been  generally  formulated  using  either  the  indif¬ 
ference  zone  approach  due  to  Bechhofer  (1954)  or  the  subset  selection  approach  of  Gupta 
(1956,  1965).  Starting  from  the  early  developments  in  the  1950’s,  these  problems  have 
been  extensively  studied  under  various  model  assumptions  and  modifications  in  the  rank¬ 
ing  goals.  A  comprehensive  survey  of  these  developments  is  provided  in  Gupta  and  Pan- 
chapakesan  (1979),  who  have  in  a  later  paper  (1985)  given  a  review  of  these  and  subsequent 
developments  with  historical  perspectives. 

In  the  present  paper,  we  review  some  recent  developments  in  the  ranking  and  selection 
theory.  We  will  focus  our  attention  on  the  following  topics:  (A)  Selecting  the  largest  normal 
mean  and  estimating  the  selected  mean,  (B)  Empirical  Bayes  selection,  (C)  Selecting  the 
important  regression  variables,  (D)  Sequential  selection  rules,  and  (E)  Lower  confidence 
bounds  for  the  probability  of  a  correct  selection. 


2.  Finding  the  Largest  Normal  Mean  and  Estimating  the  Selected  Mean 

Let  be  k(>  2)  normal  populations  with  unknown  means  and  a 

common  known  variance  <r3.  Let  0[i]  <  . . .  <  #[*]  denote  the  ordered  0,.  The  population 
associated  with  •[*]  is  called  the  best  population.  The  goal  is  to  select  one  of  the  k 
populations  as  the  best.  Since  no  procedure  assures  the  selection  of  the  best  with  certainty, 
estimation  of  the  mean  of  the  selected  population  is  of  practical  interest. 
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Let  Xi  denote  the  sample  mean  of  n,  independent  observations  from  x,-,  *  =  1, . . . ,  k. 
The  so-called  natural  selection  rule  selects  as  the  best  the  population  that  yields  the 
largest  Xi.  When  the  sample  sizes  »i, . .  •  ,»*  are  all  equal,  this  rule  is  the  uniformly  best 
permutation  invariant  selection  rule  for  a  general  class  of  loss  functions.  However,  for 
unequal  sample  sizes,  the  natural  selection  rule  loses  much  of  its  optimality  (see  Gupta 
and  Miescke  (1988)). 

A  natural  estimator  of  9(k),  the  selected  8,-,  is  Jfyj,  the  largest  Xi.  However, 
overestimates  0[*j  and  thus  overestimates  6^)  even  more.  Recognizing  this,  alternative 
estimators  have  been  studied  for  the  present  and  other  experimental  models  and  goals  by 
the  following  authors:  Sarkadi  (1967),  Dahiya  (1974),  Cohen  and  Sackrowitz  (1982),  Sack- 
rowitz  and  Samuel-Calm  (1984,1986),  Jeyaratnam  and  Panchapakesan  (1984,1986,1988), 
Vellaisamy  and  Sharma  (1988,1989),  Vellaisamy,  Kumar  and  Sharma  (1988),  and  Ven¬ 
ter  (1988).  Since  selection  is  made  first,  the  preceding  estimation  problem  is  known  as 
estimation  after  selection. 

Cohen  and  Sackrowitz  (1988)  presented  a  decision-theoretic  framework  for  the  com¬ 
bined  selection-estimation  problem,  and  derived  results  for  the  case  of  k  =  2  and  nj  =  n2. 
Recently,  Gupta  and  Miescke  (1990)  have  extended  the  results  of  Cohen  and  Sackrowitz 
(1988)  and  provided  a  detailed  discussion  for  normal  distributions  problem.  Rather  than 
the  “estimating  after  selection,”  the  decision-theoretic  treatment  of  the  combined  decision 
problem  leads  to  “selecting  after  estimation.” 

We  will  first  discuss  the  decision-theoretic  approach  under  a  general  framework  and 
then  examine  the  normal  means  case. 


2.1  General  Framework 

Let  X  =  (ATi,... ,-Xfc)  be  a  random  vector  of  observations  having  pdf  f{x\6)  = 

k 

II  /t(*«|0i)>  where  z  =  (zi,...,z*)  and  9  =  (fj,...,0*).  Here,  X  may  be  a  vector  of 

•'■i 

sufficient  statistics  for  9% The  goal  is  to  select  the  “population”  associated  with 
#[*]  and  to  simultaneously  sstimate  t„  the  selected  #-value.  For  this  combined  problem, 
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a  nonrandomized  decision  rule  is: 


4(i)  —  ^*(x)(?))  (2.1) 

where  s(x)  €  {1, 2, . . . ,  k}  is  the  selection  rule,  and  £(x)  is  an  estimate  for  $i, i  =  1, . . . , k. 
We  assume  an  additive  loss  function  L(9,  d)  given  by 

L(9,d)  =  A{0_,s)  +  B(9t,ls)  (2.2) 

where  A  is  the  loss  incurred  in  selecting  x,  as  the  best  population  when  9  is  the  true 
parametric  vector,  and  B  is  the  loss  of  estimating  6t  by  l,. 

Adopting  the  Bayes  approach,  we  assume  that  6  =  (0i, . . . ,  0*)  has  a  prior  distribu¬ 
tion  G.  Then,  for  X  =  x,  the  posterior  risk  of  d{x)  can  be  expressed  as: 

r(d(x))  =  rA(«(x))  +  rB(s(x),  *.(x)(?)),  (2-3) 

where 

ta(s(x))  =  E{A(&,s(z))\X  —  x),  and 

rB(«(x),/.(x)(x))  =  E{B{Qg(x)}t’$(z)  (?))!■<!(  =  ?}• 

The  following  theorem  of  Gupta  and  Miescke  (1990)  is  an  extension  of  a  result  of 
Cohen  and  Sackrowitz  (1988). 

Theorem  2.1.  Let  l?(x)  minimize  rB(i, £(*)).  i  =  1  and  let  s*(x)  minimize 

rx(s(x))  +  rB(«(x),<*^Ij(x)).  Then  the  Bayes  decision  rule  is: 

r(x)  =  (s*(x),  c(x)(?)). 


Remark  2.1.  It  can  be  seen  that  the  combined  selection-estimation  problem  is  in  a  sense 
“selecting  after  estimation.” 

Corollary  2.1.  Whenever  at  X  =  x,  rB(»,  £*(x))  does  not  depend  on  i  €  (1,2,...,  Ar},  a*  (x) 
minimizes  rx(s(x)). 

Let  a*(x)  denote  the  natural  selection  rule  which  selects  the  population  corresponding 
to  the  largest  x,.  The  following  example  shows  that  aN  is  not  same  as  «*  in  general. 
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Example  2.1.  Let  Xi  ~  N($i,  1),  i  =  1, . . . , k,  be  independent.  Assume  that  0* , . . . ,  0* 
are  iid  having  the  density  exp(-0),  0  >  0,  and  consider  a  loss  function  given  by  L(0,d)  = 
A(0,a)  +  (9t  -  it)2,  where  A  is  permutation  invariant  and  favors  selecting  large  0-values. 
A  posteriori,  at  X  =  x,  ©i,...,0*  are  independent  and  the  posterior  density  of  0,-  is 
—  y,)/*(y,),  9i  >  0,  where  <p  and  #  denote  the  N( 0, 1)  density  and  cdf,  respectively, 
and  yi  =  x,-  —  1,  *  =  1, . . . ,  k.  Straightforward  computations  yield,  for  i  =  1, . . . ,  Jfe, 

f  <•(?)  =«|Oi l|X  =  S]=„  +  f^ 

i  ra(t,  *<(?))  =  Var[e,-|AT  =  x] 

{  +  -[«]’• 

Thus,  although  a^(x)  minimizes  rii(s(x)),sAr  is  different  from  s*,  since  ra(t,^(x))  de¬ 
pends  on  *  6  {1, ... ,  k }. 


2.2.  Normal  Means  Problem 


Let  Xi  denote  the  mean  of  a  random  sample  of  size  n<  from  N(6,,a2)  population, 
i  =  1, . . . , k,  where  the  common  variance  a2  is  assumed  to  be  known.  Apriori,  0i , . . . ,  0* 
are  independent  and  0<  ~  N{m,qi),  *  =  l,...,jfc.  Thus,  given  X  =  x,  0i,...,0*  are 
aposteriori  independent  with  Q%\X  =  x  ~  N  ( 21  )  ’  *  =  1, ...» A.  Also,  the 

X,’s  are  marginally  independent  with  Xi  ~  N(m,pi  +  g,),  where  p,-  =  a2  jn,,  »'  =  1, . . . ,  k. 


Equal  Sample  Sizes.  We  let  tti  =  ...  =  »*  =  n.  We  also  assume  that  n\  =  . . .  =  /**  =  p 
and  gi  =  ...  =  g*  =  g,  i.e.  we  have  exchangeable  normal  priors.  We  assume  the  loss 
function  L(0,d)  in  (2.2)  with  two  possible  forms  2?i(0«,l<)  and  2?a(0«,l()  for  -B(0«,£«), 
given  by 


f  =  |*.  -  M  ,,4l 

Also,  A(0,«)  in  (2.2)  is  assumed  to  be  permutation  symmetric  and  favorable  to  selection 
of  larger  0-values. 


Under  the  above  assumptions,  Gupta  and  Miescke  (1990)  have  shown  that  the  Bayes 
rule  d*  =  (aVJ.)  satisfies  s*  =  aN  and  t\ ($)  =  £{0,|X  =  x),  »  =  1,.. . ,1k. 

Consider  the  natural  decision  rule  dN  =  (s^,^),  where  l*(x)  =  z<,  »  =  1  ,...,*. 
Although,  from  the  frequentist  point  of  view,  dN  has  the  undesirable  feature  of  overes- 
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timating  the  mean  of  the  selected  population,  it  has  been  shown  by  Gupta  and  Miescke 
(1990)  to  be  an  extended  Bayes  rule. 


Unequal  Sample  Sizes  Case.  Here  we  will  consider  two  particular  loss  functions, 


namely, 

JMtf.d)  =e(#W-#.)  +  |#.-U 

\L2($td)  =c{B[k]-9.)*  +  {8'-l.)\ 


(2.5) 


where  c  >  0  gives  relative  weights  to  the  two  parts  of  the  loss  function.  Since  the  sample 
sizes  are  unequal,  it  is  appropriate  to  consider  non-exchangeable  priors,  9,-  ~  N(m,qi),  *  = 
1, . . . ,  k.  The  9t’s  are,  of  course,  independent. 


Under  the  above  setup,  with  loss  function  Lu  the  Bayes  rule  (by  Theorem  2.1)  employs 
the  estimator  /J(x)  =  (g»x,  +  PiM,)/(g,  +  Pi)  for  0,-,  i  =  1,...  ,fc,  and  one  has  to  find  s*(x). 
For  any  decision  rule  d  =  («,£*),  the  posterior  risk  at  X  =  x  associated  with  selection 
s(x)  =  i  €  {1, ...» ife}  turns  out  to  be 


c 


E{  Q[k]\X  =  x}~ 


g.X,  +  Pifij)  /  2 piqj  \  » 

li  +  Pi  \*(qi+Pi)J 


This  leads  to  the  Bayes  rule  d*  =  (s *,£*),  where  f*(x)  =  (g,x,+p,p,)/(gi+p,), 
and  s*(x)  maximizes  cl*(x)  —  [2g,p,/7r(g,-  +  p,)]» ,  i  = 


It  is  interesting  to  note  three  special  cases,  which  are  as  follows. 


Case  1:  Noninformative  prior  (g,  — ►  oo,  i  =  1,...,*).  In  this  case,  £,■  (x)  =  x,,  i  = 
1,.. . ,kt  and  s*(x)  maximizes  x,-  —  e-1(2p</jr)» ,  *  =  1,.. .  ,k. 

Case  2:  Prior  variances  proportional  to  sample  variances  (g;  =  7p,-,  i  =  for 

some  7  >  0).  In  this  case,  <J(x)  =  (71  i  +  /*,)/( 7  +  1),  *  =  1,. . .  ,fc,  and  «*(x)  maximizes 
l*i(x)  -  c-,(27P*/(7  +  !)*)*>  *  =  1,...,*.  In  particular,  for  px  =  ...  =  pk  =  p  (say), 
fj(x)  =  (7Xi  +  p)/(7  +  l),  *  =  l,...,fc,  and  «*(x)  maximizes  x,-c-1{2(7  +  l)p</x}a,  t  = 

a  •  •  f 

Case  S:  Posterior  decreasing  in  transposition  (DT),  i.e.  q~l  +  p~l  =  r-1,  i  = 
for  some  fixed  r  >  0,  In  this  case,  <J(x)  =  r(pr1x,  +  g,"V«)»  *  =  1,...,*,  and  s*(x) 
maximises  /J(f),  »  =  In  particular,  when  p\  =  ...  =  /**  =  p  (say),  l?(z)  = 

pjxr[xi  -  p)  +M>  »  =  1  and  «*(x)  maximises  p^^**  -#*)>  *  =  !»•••»*• 
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It  has  been  shown  by  Gupta  and  Miescke  (1000)  that  the  decision  rule  of  Case  1 
(noninformative  prior)  is  an  extended  Bayes  rule. 

In  the  case  of  loss  function  La,  the  analysis  gets  more  complicated.  For  finding  the 
Bayes  rule,  we  have  the  same  £}(x)  as  in  the  case  of  Li,  but  for  finding  s*(z)  one  has  to 
minimize 

c£{[e[fc]  -  e,]a \x  =  x}  +  (2.6) 

1 1  1  q*  + p% 

The  difficulty  lies  in  the  fact  that,  for  any  the  conditional  distribution  of  (©[*], 0,)  at 
X  =  x  does  not  yield  simpler  representations  for  the  conditional  expectation  in  (2.6), 
which  in  most  situations  has  to  be  evaluated  on  a  computer. 

As  in  the  case  of  Lx,  we  can  specialize  the  problem  in  the  three  special  cases  regarding 
the  assumptions  about  the  prior.  In  Case  3,  the  Bayes  rule  is  the  same  as  in  the  case  of 
Lx. 


3.  Empirical  Bayes  Selection  Procedures 

The  empirical  Bayes  approach  in  statistical  decision  theory  is  typically  appropriate 
when  one  is  confronted  repeatedly  and  independently  with  the  same  decision  problem.  In 
such  instances,  it  is  reasonable  to  formulate  the  component  problem  with  respect  to  an 
unknown  prior  distribution  on  the  parameter  space.  One  then  uses  information  borrowed 
from  other  sources  to  improve  the  decision  procedure  for  each  component.  This  approach 
is  due  to  Robbins  (1956,1964).  Empirical  Bayes  procedures  have  been  derived  for  multiple 
decision  problems  by  Deely  (1965).  Recently,  Gupta  and  Hsiao  (1983),  Gupta  and  Leu 
(1983),  and  Gupta  and  Liang  (1986, 1988a, b,  1989a, b,c)  have  investigated  empirical  Bayes 
procedures  for  several  selection  problems.  Many  such  empirical  Bayes  procedures  have 
been  shown  to  be  asymptotically  optimal  in  the  sense  that  the  component  Bayes  risk  will 
converge  to  the  optimal  Bayes  risk  which  would  have  been  obtained  if  the  prior  distribution 
were  fully  known,  and  the  Bayes  procedure  with  respect  to  this  prior  distribution  was  used. 

In  this  section,  we  will  describe  empirical  Bayes  selection  procedures  with  respect  to 
a  standard.  Two  kinds  of  empirical  Bayes  procedures  will  be  considered.  One  is  to  incor- 


7 


porate  past  data  to  improve  the  current  decision.  The  other  is  to  incorporate  information 
from  each  other  so  as  to  simultaneously  improve  the  decisions  for  each  of  the  component 
problems  under  study.  A  Poisson  distribution  model  is  used  as  an  example  to  describe  the 
empirical  Bayes  idea  and  methods. 

3.1.  Formulation  of  the  Empirical  Bayes  Selection  Problem 

Let  X|,...,x*  denote  k  independent  populations.  For  each  t  =  1, _ _ lb,  let  X{  denote 

a  random  observation  arising  from  population  x,-.  It  is  assumed  that  Xi  follows  a  Poisson 
distribution  with  probability  function  /,(x|0,j  where 

fi(x\$i)  =  e~8i$f/x\,  x  =  0, 1, 2, ... ;  0,-  >  0. 

Let  0o  >  0  be  a  known  standard.  Population  x,  is  said  to  be  good  if  0t-  >  0O,  and 
bad  otherwise.  The  goal  is  to  select  all  the  good  populations  and  exclude  all  the  bad 
populations. 

Let  0  =  {£  =  (0i, •  -  - , 0*)|0»  >  0,  »  =  1,...,A:}  be  the  parameter  space  and  let 
A  =  {fl  =  (<*i,  •  •  •  =0,1,  t  =  1, ...,&}  be  the  action  space.  When  action  a  is  taken, 

it  means  that  population  x,  is  selected  as  a  good  population  if  a,-  =  1,  and  excluded  as  a 
bad  one  if  a%  =  0.  For  each  0  €  fl  and  a  6  A,  the  loss  function  L(0,q)  is  defined  to  be: 

k  k 

=  X>(*o  -  #<)/(*>  -  0<)  +  £(1  -  a <)(0<  -  0o)/(0,  -  0O)  (3.1) 

«=i  »=i 

where  I(x)  =  1(0)  if  x  >  (<)0. 

It  is  assumed  that  for  each  t,  the  parameter  0,  is  a  realization  of  a  random  variable 

0,-  which  has  a  prior  distribution  G,.  It  is  also  assumed  that  6i,...,6*  are  mutually 

* 

independent.  Thus  0  =  (0i,...,0*)  has  a  joint  prior  distribution  G(0)  =  [J  G!«(^»)- 

*«i 

For  each  i  =  1, . . . ,  A,  let  X%  be  the  sample  space  of  X,-,  and  let  X  =  X\  x . . .  x  Xfc.  Let 
X  =  (Xi,...,Xfc).  A  selection  rule  d  =  (di,...,d*)  is  defined  to  be  a  mapping  from  X 
into  [0, 1]*  such  that  di(x)  is  the  probability  of  selecting  population  x<  as  a  good  population 
when  X  =  x  is  observed.  Let  D  be  the  class  of  all  selection  rules,  and  let  r(G,d)  denote 
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the  Bayes  risk  associated  with  each  d  €  D.  Then,  r(G)  =  inf  r(G,d)  is  the  minimum 

dsD 

Bayes  risk. 


The  Bayes  risk  associated  with  any  rule  d  €  D  can  be  written  as: 

k 


-•(e.ifl  =  £>*(§.*) 


«=i 


where 


ri(G,di)  =  ^  (tfo  -  <fii{xi)\di(x)  fj(xj)  +  C„ 
xer  j= 1 


(3.2) 


(3.3) 


where  p,(zj)  =  E\Qi\Xi  —  x,]  =  h,(x,  -f  1  )/h<(xj)  is  the  posterior  mean  of  ©,■  given 
Xi  =  x„  hi(xi)  =  fi(xi)/a(xi),  fi(Xi)  =  /0°°  fi(xi\6)dGi($)  =  /0°°  e-^/x,!dG,(d)  = 
a(x,)h,(xt)  is  the  marginal  probability  function  of  the  random  variable  X,-,  and  a(x,)  = 
(x,!)“l,  ht(xi)  =  /0°°e-^dG,(tf)  and  G,  =  J~(#  -  0o)dGi(9). 

It  follows  that  a  Bayes  rule,  say  ds  =  ( dm,...,d*B ),  is  clearly  given  by:  For  each 

»  =  1, . . . ,  k, 

<1.b(x)  =  f  !  f  <<*•>  2  (3.4) 

l  0  otherwise. 

k 

The  minimum  Bayes  risk  is:  r(G)  =  £  »’«(<?,  d^a). 

«=i 

When  the  prior  distribution  G  is  unknown,  it  is  not  possible  to  apply  the  Bayes  rule 
4b  for  the  selection  problem.  In  the  following,  the  empirical  Bayes  approach  of  Robbins 
(1956,1964)  is  employed.  First,  we  discuss  the  case  where  certain  past  observations  from 
each  of  the  Jk  populations  are  available. 


3.3.  Incorporating  Information  from  Past  Observations 

According  to  the  usual  empirical  Bayes  framework,  it  is  assumed  that  for  each  *  = 
1, ...  ,1k,  there  are  marginally  iid  past  random  observations  Xu , . . . ,  X,„  with  marginal 
probability  function  /,(x)  available  when  a  decision  is  made.  Three  empirical  Bayes  selec¬ 
tion  rules  are  constructed  according  to  how  much  we  know  about  the  prior  distribution 

Q- 
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3.2.1.  A  Nonparametric  Empirical  Bayes  Rule 


It  is  assumed  that  the  prior  distribution  G  is  completely  unknown.  Thus,  a  nonpara¬ 
metric  empirical  Bayes  approach  is  employed.  It  should  be  noted  that  v?»(xt)  is  increasing 
in  %i  for  each  »  =  1, . . . ,  k.  Therefore  the  Bayes  rule  da  is  a  monotone  selection  rule.  Thus, 
it  is  desirable  that  the  considered  empirical  Bayes  rule  be  also  monotone. 

For  each  t  =  1, . . . ,  k,  and  x  =  0, 1, 2, . . .,  define 

n 

/»»(*)  =  n  1 

}=i 

h»n(z)  =  /«»(*)/<»(*)• 

Let  Nin  =  max  Xu  —  1  and  for  each  x  =  0, 1, . . . ,  iV,n,  define 

l<j<n 

<Pin{x)  =  [h,n(l  +  l)  +  tfn]/[h<n(®)  +  <5n], 


where  6n  >  0  is  such  that  6n  =  o(l). 

Since  <po ,(x)  may  not  be  increasing  in  x,  a  smoothed  version  of  £>,„  (x)  is  given  be¬ 
low.  Let  {v9*n(i)}^i“o  be  the  isotonic  regression  of  {p»n(x)}£Lo  with  random  weights 
{Win(x)}*£0,  where  Win{x)  =  [h,„(i)  +  $„]«(x  +  1).  For  y  >  Nin,  let  <P*in[y)  =  <P*in(Nin)- 
Therefore,  (x)  is  nondecreasing  in  x.  We  may  use  <Pin{x)  to  estimate  <Pi(x).  Based  on 
<Pin{x),  i  =  l,...,fc,  an  empirical  Bayes  rule  d*  =  (dfn, . . . ,  d£n)  is  proposed  as  follows: 
For  each  *  =  1, . . . ,  k,  and  z  €  X, 


if  *>•„{*.)  >  90, 
otherwise. 


(3.5) 


3.3.2.  A  Parametric  Empirical  Bayes  Rule 

It  is  assumed  that  the  prior  distribution  G,  is  the  gamma  distribution  with  unknown 
shape  and  scale  parameters  a*  and  /?«,  respectively,  i—  1, . . .  ,k.  That  is,  G,  has  a  density 
function  pi(#|a,,^,),  where 

ti(9\cn, 0i)  =  9  >  0. 
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Then,  Xu, . . . ,  AT,n  are  iid  with  marginal  probability  function  fi(x)  =  T(z  +  / 

[r(a,)(l  +  A),+0,ix!],  x  =  0, 1,2, ....  Also,  (x)  =  Straightforward  computations 

yield  that  Mu  =  E[X Vi]  =  *i/Pi,  =  ^A,2]  =  a,(a,  +  l)/?~2  +  a,/?,"1.  Thus,  /?,•  = 
-Mil  -  Mil)-1  and  «•  =  0*»2  -  Mil  Therefore,  p,(x)  =  [x(/i,2  -  mu  - 

n  n 

For  each  t  =  1,  — ,  Ar,  let  Miin  =  n_1  £  AtJ  and  M<2n  =  n_1  E  A*.  That  is,  Miln 

j=i  ;=i 

and  /it-2n  are  moment  estimators  of  Mu  and  /*«,  respectively.  Since  it  is  possible  that 
Miln  -  Miln  ~  Miln  -  '7*»  ^  0  though  Mu  -  Mil  -  Mil  >  °> thus,  for  each  x  =  0, 1,  •  •  •»  define 


\ Pin 


**•-*«■ 

X 


if  Tfi«  >  0, 
otherwise. 


(3.6) 


Then,  an  empirical  Bayes  rule  dn  =  (dln, . . .  ,d*n)  is  proposed  as  follows: 
t  =  l,...,fc,  and  x  €  X, 


if  £<»(*<)  >  0o ; 
otherwise. 


For  each 

(3.7) 


3.2.3.  A  Hierarchical  Empirical  Bayes  Rule 


Suppose  that  the  prior  distribution  G,  is  a  gamma  distribution  with  a  known  shape 
parameter  a,  and  an  unknown  scale  parameter  /?j.  In  this  situation,  the  preceding  para¬ 
metric  empirical  Bayes  approach  can  be  applied  here.  However,  a  new  method,  called 
hierarchical  empirical  Bayes,  is  introduced  in  the  following. 


Since  Pi  is  a  scale  parameter,  it  is  assumed  that  Pi  has  an  improper  prior  p{Pi)  = 
-J--,  Pi  >  0.  Thus,  conditional  on  Pi,  Xu,. . . ,Xin  are  iid  with  the  probability  func¬ 
tion  fi(x\Pi)  =  /0°°  fi(x\0)gi{0\ai,pi)d0  =  ,  *  =  0,1,2. .  Therefore, 

-Xii, .  •  ■  i Xin  has  a  joint  marginal  probability  function  /»(*»i,..  •  ,*»»)»  where 


fOO  * 

>  •  •  •  *  *ii»)  =  /  n  !.(*<i\mpw 

J°  jm  1 


-  TT  [r(z,y  +aj)1 

JL1,  l  J 


r(na,)r(6,  -  noi)/r(6i) 
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where  6,  =  not*  +  £  x<i-  Thus,  the  posterior  density  function  of  /?,  given  (Xfl, . . . ,  X<n) 

y=i 

X,„)  is 

p(fc\xiu...,xin)  =  +  A)-6T(6,)/[r(na,)r(6,  -  no,)], 

and  the  posterior  mean  of  Pi  given  (x,i, . . . ,  z,n)  is 

M  not» —  if  £  Xij  >  2, 

Pin  =  E[fii\Xih  •  •  • ,  Z»'n]  =  *  $3  *»i  — ^  ^  =  1 

i=*» 

.  oo  otherwise. 

For  each  *  =  1, . . . ,  k,  and  (x»i, . . . ,  x,n),  define 


^  0  otherwise. 


We  then  propose  an  empirical  Bayes  rule  d„  =  (din,  •  ■  • ,  dfcn)  as  follows:  For  each 

i  =  1, . . . ,  A>, 

dtn(l)  =  {  *  if  >  *05  (3.9) 

l-  0  otherwise. 


3.2.4.  Asymptotic  Optimality 

For  an  empirical  Bayes  selection  rule  dn,  let  r(G,dn)  denote  the  overall  Bayes  risk. 
That  is, 

k  9  oo 

r(Gi4n)  =  ^  1  }  ]  [*0  —  *Pi(Xi)}Ein[din{xi)]fi{xi)  ~h  C{ 

»=1  U=0 

where  the  expectation  £,n  is  taken  with  respect  to  (X,i, . . . ,  X,n).  Since  r(G)  is  the 
minimal  Bayes  risk  r(G,dn)-r(G)  >  0  for  all  n.  The  nonnegative  difference  r(G,dn)-r(G) 
can  be  used  as  a  measure  of  optimality  of  the  empirical  Bayes  rule  dn. 

Definition  3.1.  Let  {dn}£Li  be  a  sequence  of  empirical  Bayes  rules.  {dn}^L !  is  said  to  be 
asymptotically  optimal  of  order  r„  relative  to  the  prior  distribution  G  if  r(G,dn)  —  r(G)  = 
0(r„)»  where  {rn}STLi  is  *  sequence  of  positive  numbers  such  that  Jkn^r,,  =  0. 
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Following  Gupta  and  Liang  (1089c),  it  is  easy  to  obtain  the  following  result.  Let 
Biifio)  =  {x\ifii(x)  <  0O }  and  let 

_  f  max  .0,(0o )  if  B,(0O)  #  <f>\ 

\  -1  otherwise 

Theorem  3.1.  Let  dn  denote  any  of  the  three  precedingly  constructed  empirical  Bayes 
selection  rules  d* ,  dn  and  dn.  Suppose  that  /0°°  0dG,(0)  <  oo  and  m,-  <  oo  for  all  t  = 

1,  ...,fc.  Then,  r(G,dn)  —  r(G)  =  0(exp(— cn))  for  some  positive  constant  c,  where  the 
value  of  e  varies  depending  on  the  empirical  Bayes  selection  rule  used. 

3.3.  Incorporating  Information  from  Other  Components 

We  now  consider  the  case  where  it  is  assumed  that  the  k  prior  distributions  Gi , . . . ,  Gk 
are  identical,  but  there  is  no  past  observations  available.  Under  this  assumption,  X\, . . . ,  Xk 
are  marginally  iid  with  probability  function  /(*)  =  /*  e“®0*/z!dG(0)  where  G  =  Gi  = 

...  =  G*.  Therefore,  we  can  still  incorporate  information  from  each  other  to  improve 
the  decisions  for  each  of  the  k  component  decision  problems.  The  idea  is  described  again 
through  the  nonparametric  empirical  Bayes,  the  parametric  empirical  Bayes  and  the  hier¬ 
archical  empirical  Bayes  approaches. 

3.3.1.  A  Nonparametric  Empirical  Bayes  Rule 

It  is  assumed  that  the  prior  distribution  G  is  completely  unknown.  Following  the 
discussion  of  Subsection  3.2.1,  a  nonparametric  empirical  Bayes  selection  rule  is  constructed 
as  follows. 

k 

For  each  *  =  1,.. .  ,*,  let  W,*  =  max  - 1 ,  and  let  /,*  (y)  =  £  I{yy(Xj),  hik[y)  = 

3#* 

fik(y)/a{y),  y  =  0,1,....  Also,  let  v?,*(y)  =  [h.*(y  +  1)  +  6k]/[hik{y)  +  0*]  for  each 
y  =  0, 1, . . . , N{k,  where  6k  >  0  is  such  that  Sk  =  o(l). 

Let  {<P<fc(y)}*-o  the  isotonic  regression  of  {¥>»k(y)}*,0  w{th  random  weights  {Wik(y)}y^0, 

where  Wik(y)  =  (h,*(y)  +  0*]«(y  +  1).  For  y  >  Nik,  let  v?**(y)  =  <P*ik{Nik)-  Now,  an  em¬ 
pirical  Bayes  rule  d£  =  (djfc, . . . ,  d*hh)  is  proposed  ss  follows:  For  each  i  =  1, . . . ,  k,  and 
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(Xu . . . ,  Xfc)  =  (xi, . . . , Xfc),  define 


if  <Pik{z%)  >  *o\ 
otherwise. 


(3.10) 


3.3.2.  A  Parametric  Empirical  Bayes  Rule 

It  is  assumed  that  the  prior  distribution  G  is  a  member  of  gamma  distribution  family 
with  probability  density  function  g{9\a,0),  where 

g($\a,0)=f3°‘$*-l*-t'/r(a),  0>O 


and  both  the  parameters  a  and  0  are  unknown.  Following  the  discussion  of  Subsection 

*  * 

3.2.2,  for  each  »  =  1,...,*,  let  £  Xj,  and  /a2*(»)  =  ^  £  Xj.  Let 


>=j 


Tik  -  M2*(»)  ~  Mi* (*')  -/*!*(*)■  Define 

*.(*<>  =  (  ifr“>0; 

( Xi  otherwise. 


(3.11) 


An  empirical  Bayes  rule  d*  =  (di*, . . . ,  d**)  is  proposed  as  follows:  For  each  i  =  1, . . . ,  k 
and  =  (zi ,...,x*),  define 


Ui)  =  { \ 

1 0  otherwise. 


(3.12) 


3.3.3.  A  Hierarchical  Empirical  Bayes  Rule 

It  is  assumed  that  the  prior  distribution  G  is  a  gamma  distribution  with  a  known 
shape  parameter  a  and  an  unknown  scale  parameter  0.  Similar  to  that  of  Subsection 

3.2.3,  a  hierarchical  empirical  Bayes  rule  d*  =  (di*, . .  .,<!**)  is  constructed  as  follows. 

For  given  (Xu . . . ,  X*)  =  (xj, . . . ,  zfc),  let 

"‘-Ms*’-1) 

v  oo  otherwise. 


For  each  *  =  and  (Xi,...,X*)  =  (xi,...,x*),  define 


Vik{Xi)  =  { 


(x,  +  o)/(l  +  /?*)  ifE*y>2; 

3=1 

K  0  otherwise. 


Define,  for  each  i  =  1  and  (Xi,...,Xfc)  =  (xi,...,xk), 

1 0  otherwise. 


(3.13) 


(3.14) 


3.3.4.  Asymptotic  Optimality 

Let  dk  denote  any  of  the  three  precedingly  constructed  empirical  Bayes  selection  rules. 
The  associated  overall  Bayes  risk  r(G,  dk)  is: 

k 

;*), 

«=i 

where 

r,(G,  dik)  =  EikEiUe o  -  V>i(Xi))dtkm  +  C 

where  the  expectation  E{  is  taken  with  respect  to  X,  and  the  expectation  is  taken 
with  respect  to  (Xi,...,X,-_i,X»+i,...,X*).  Also,  here  C  =  /fl“(0  —  6o)dG(0). 

Since  r(G)  is  the  minimal  Bayes  risk,  r(G,  d*)  —  r(G)  >  0  for  all  k. 

Definition  3.2. 

(a)  A  selection  rule  dk  is  said  to  be  weakly  asymptotically  optimal  relative  to  the  prior 
distribution  G  if 

|r(G,d*)  -  r(G))/k  -»  0  as  k  -*•  oo. 

(b)  A  selection  rule  dk  is  said  to  be  strongly  asymptotically  optimal  relative  to  the  prior 
distribution  G  if 

r(G,dfc)  -  r(G)  — >  0  as  k  -*  oo. 


Note  that  the  strong  asymptotic  optimality  implies  the  weak  asymptotic  optimality. 
The  weak  asymptotic  optimality  of  compound  decision  rules  has  been  studied  in  the  liter¬ 
ature  by  many  authors,  notably  Vardeman  (1078,1980),  Gilliland  and  Hannan  (1986),  and 


Gilliland,  Hannan  and  Hwang  (1976),  though  the  formulation  of  their  compound  decision 
problems  are  different  from  the  one  described  previously.  For  the  present  problem,  Gupta 
and  Liang  (1989c)  obtained  the  following  strong  asymptotic  optimality. 


Let  B(0O)  =  (x\<p(x)  <  Oo}  where  <p( x)  =  <pi(x)  =  . . .  =  <pk(x)  since  G\  =  ...  =  Gk 
and  let 


m  = 


|  maxB(0o) 


otherwise. 


Theorem  3.3.  Let  d*  denote  any  of  the  three  precedingly  constructed  empirical  Bayes 
selection  rules  d\,  dk  and  d.  Suppose  that  /Q°°  BdG(8)  <  oo  and  m  <  oo.  Then,  r(G,dk)  - 
r(G)  =  0(exp(—ck  +  In  A;))  for  some  positive  constant  c,  where  the  value  of  c  varies 
depending  on  the  empirical  Bayes  rule  used. 


4.  Selection  of  Variables  in  Linear  Regression 

In  applying  regression  analysis  in  practical  situations  for  prediction  purposes  such  as 
economic  forecasting  or  weather  prediction,  one  is  faced  with  a  large  number  of  indepen¬ 
dent  variables.  In  such  situations,  it  may  well  be  sufficient  to  consider  only  a  subset  of 
these  predictor  variables  for  an  “adequate”  prediction.  Thus  arises  a  problem  of  choosing  a 
“good”  subset  of  these  variables.  Hocking  (1976)  and  Thompson  (1978a, b)  have  reviewed 
several  criteria  and  techniques  that  have  been  used  in  practice.  However,  these  procedures 
are  ad  hoc  in  nature  and  are  not  designed  to  control  the  probability  of  selecting  the  im¬ 
portant  variables.  McCabe  and  Arvesen  (1974)  and  Arvesen  and  McCabe  (1975)  were  first 
to  formulate  the  problem  in  the  framework  of  Gupta-type  subset  selection  by  considering 
models  involving  all  possible  subsets  of  an  arbitrarily  chosen  size.  Huang  and  Panchapake- 
san  (1982)  considered  a  different  formulation  taking  into  consideration  all  possible  reduced 
models.  Using  different  criteria  for  comparing  any  reduced  model  with  the  “true”  model, 
this  problem  was  also  investigated  by  Hsu  and  Huang  (1982)  who  used  a  sequential  proce¬ 
dure,  and  by  Gupta,  Huang  and  Chang  (1984)  who  used  simultaneous  tests  of  a  family  of 
hypotheses  in  constructing  their  procedure.  Recently,  Gupta  and  Huang  (1988,1989)  have 
further  studied  this  problem.  We  discuss  their  results  below. 
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Consider  the  standard  linear  model 


Y  =  X§  +  $  (4.1) 

where  Y'  —  (Yit...,Yn)  is  an  n-vector  of  random  observations,  X  =  ATp_i] 

is  an  n  x  p  matrix  of  known  constants,  §'  =  (/3o.0i.  ■ . .  ,/V-i)  is  a  p-vector  of  unknown 
parameters,  and  g  ~  N( 0,  a2In) .  Here  1  is  a  column  vector  of  l’s  and  In  is  an  n  x  n  identity 
matrix.  The  model  (4.1)  with  p-1  independent  variables  is  considered  as  the  "true”  model. 
Any  reduced  model  whose  UX  matrix”  has  r  columns  is  obtained  by  retaining  any  r  -  1 
of  the  p  -  1  independent  variables  Xi,Xi, . . . , Xp-i,  where  2  <  r  <  p.  For  each  r,  there 
are  kr  =  (*“})  such  models,  which  are  indexed  arbitrarily  t  =  1, . . . ,  kr.  A  typical  model 
from  this  group  will  be  referred  to  as  A/rt,  which  can  be  written  as 

E(Y)  =  Xri§ri  (4.2) 

where  Xrt  and  f3ri  are  obtained  from  X  and  §,  respectively,  corresponding  to  the  variables 
that  are  retained  in  the  model.  In  our  discussion,  ail  expectations  and  probabilities  are 
calculated  under  the  true  model  (4.1). 

Let  SSrt-  denote  the  residual  sum  of  squares  for  the  reduced  model  Mr,-,  1  <  »  <  Jfer,  2  < 
r  <  p.  Then 

SSh/^-X2^,**}  (4.3) 

where  vT  =  n  —  r  is  the  degrees  of  freedom  and  Ar,  is  the  noncentrality  parameter.  This 
gives 

E{SSri)  =  uro2  +  2<7q  Am.  (4-4) 

Since  o2  is  fixed,  it  is  clear  from  (4.4)  that  Ar,  should  not  be  large  for  a  good  model.  This 
motivates  the  criterion  employed  by  Gupta  and  Huang  (1988),  namely,  any  reduced  model 
Afri  with  the  associated  noncentrality  parameter  A r,  is  defined  to  be  inferior  if  Ar»  >  A, 
where  A  >  0  is  a  specified  constant.  The  goal  is  to  eliminate  all  inferior  models  from  the 
set  of  V~x  - 1  regression  models  including  the  true  model.  For  this  goal,  Gupta  and  Huang 
(1988)  proposed  and  studied  a  two-stage  procedure.  In  the  first  stage,  inferior  models  are 
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eliminated.  Then,  in  the  second  stage,  one  of  the  models  from  the  retained  set  (if  it  has 
more  than  one)  is  selected. 


Consider,  as  an  estimator  of  Ari, 

f  _  n  ~  P  SSri  _  Vr_ 

"  ”  2  SSpi  2 

_  n  -  p  1  ~  Rrj  _  J'r 

“2  1-  R2  2 


(4.5) 


where  R 2  and  R^  are  the  multiple  correlation  coefficients  of  the  models  (4.1)  and  (4.2), 
respectively.  Define,  for  n  —  p  >  2, 


fri=2  "  *  2[2X„  +  (p-r)l-(2p-3r).  (4.6) 

n-p 

Gupta  and  Huang  (1988)  have  shown  that  f r*  is  an  unbiased  estimator  of  rr;  =  - 

<70 

(n  —  2r),  which  is  the  standardized  total  squared  error. 

The  two-stage  procedure  R,  of  Gupta  and  Huang  (1988)  is  as  follows: 


Rs:  At  stage  1,  eliminate  all  models  Mr,  for  which 


^  dr 


(4.7) 


and  at  stage  2,  select  from  all  the  models  that  are  retained  after  stage  1  the  one  with  the 
smallest  fr»-  The  constant  dT  in  (4.7)  is  chosen  to  satisfy 


Dr  = 


2 

n-p 


n~P 
n  —  r 


(4.8) 


where  Dr  is  the  100(1  -  P*)  percent  point  of  the  noncentral  F  distribution  with  p-r  and 
n  —  p  degrees  of  freedom  and  noncentrality  parameter  A.  It  can  be  shown  that,  for  the 
rule  R„ 

P{all  inferior  models  Mr,  are  eliminated}  >  P*. 


Several  authors  have  studied  the  influence  on  the  fitted  regression  line  when  a  part  of 
the  data  is  deleted.  In  the  model  (4.1),  let  £  denote  the  usual  least  squares  estimator  of  § 
based  on  the  full  data  and  let  4a  be  the  least  squares  estimator  based  on  a  subset  of  the 
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data.  An  empirical  influence  function  for  4  i*  IFa  =  $a  —  $•  For  a  given  positive  definite 
matrix  M  and  a  nonzero  scale  factor  c,  Cook  and  Weiaberg  (1980)  defined  a  distance 
Da(M,c)  between  $  and  given  by: 


where  M  can  be  chosen  to  reflect  specific  interests.  Recently,  Gupta  and  Huang  (1989) 
have  integrated  this  concept  of  influential  data  with  their  procedure  for  selecting  important 
independent  variables  discussed  previously.  They  have  considered  deleting  one  observation 
at  a  time  from  the  data  set  Y.  Recalling  that  A/rt  denotes  a  reduced  model  obtained  by 
retaining  r  —  1  of  the  p  —  1  independent  variables,  let  denote  the  model  obtained 

from  Mri  by  deleting  the  /-th  observation  in  Y.  Corresponding  to  Ar,  in  (4.3)  associated 
with  the  model  Mrt-,  we  have  the  noncentrality  parameter  Arf(*)  associated  with  the  model 
Mri(i).  Analogous  to  Ar,  of  (4.5)  for  the  model  Afrt,  we  define,  in  the  case  of 


t  n-p-1 

Ki(t)  =  — j — 


SSrt(f) 

SSpK<) 


n  —  r  —  1 
2 


(4.9) 


We  can  find  a  constant  d'r  such  that 


=  (4.10) 

The  new  two-stage  procedure  R't  of  Gupta  and  Huang  (1989)  is  defined  exactly  as  their 
earlier  procedure  R,  except  that,  in  stage  1,  a  model  Mr,  is  eliminated  if 


Ki(t)  >  f°r  some  t  for  which  <  d'p 


(4.11) 


instead  of  (4.7). 

5.  Sequential  Selection  Procedures 

A  substantial  amount  of  original  research  on  sequential  selection  procedures  accom¬ 
plished  during  the  early  years  of  the  ranking  and  selection  theory  was  published  as  a 
monograph  by  Bechhofer,  Kiefer  and  Sobel  (1968).  These  and  subsequent  developments 
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have  been  discussed  in  Gupta  and  Panchapakesan  (1979),  who  have  recently  (1990)  re¬ 
viewed  further  developments  in  the  sequential  selection  theory.  In  our  present  discussion, 
we  will  confine  our  attention  to  a  few  specific  recent  results. 


S.l.  A  Subset  Selection  Procedure  with  a  New  Goal 


Let  , . . . , Kk  be  k  independent  normal  populations  with  unknown  means  $i,.  . . ,0*, 
respectively,  and  a  common  known  variance  a2.  For  a  specified  0*  >  0,  any  population 
for  which  0,  >  max  Oj  —  0*  is  defined  to  be  a  good  population.  Gupta  and  Liang 
(1988c)  considered  the  goal  of  selecting  a  subset  of  the  k  populations  which  includes  the 
best  population  (the  one  associated  with  the  largest  0<)  and  at  the  same  time  excludes  all 
that  are  not  good.  An  event  of  selecting  a  subset  consistent  with  this  goal  is  denoted  by 
CS(0*).  This  is  different  from  what  is  known  as  0* -correct  selection  in  the  literature. 


Let  ATa.X.a,...  be  a  sequence  of  independent  observations  from  jr*,  i  =  1, . . . ,  Ac. 

m 

For  m  >  1,  define  Y,m  =  X%j.  Let  Sm  denote  the  set  of  contending  populations  at 

j=i 

the  beginning  of  stage  m  and  let  |5m|  denote  the  size  of  Sm.  Gupta  and  Liang  (1988c) 
proposed  and  studied  the  following  procedure. 


Rn.gl'  Choose  a  in  (O,0*/2).  At  stage  m(m  =  1,2,...),  take  one  observation 
from  each  population  in  Sm.  Include  in  5m+i  only  those  ?r,’s  in  Sm  for  which 

0i ...  „  .  m0?  ,  k  - 1  , 

0  (Frm  —  Yim)  ~  .  <  1°8  ”  for  G  Sm,  r  ^  t, 

l  4  1-p* 

and  eliminate  all  other  v,’s  from  any  further  consideration.  Now,  label  as  good  only  those 
tt.  ’s  in  Sm+ 1  that  have  not  been  labeled  yet  and  for  which 

+  **  /,,  „  x  .  m(0*a  -*?).,  *  -  1  „  .. 

— 5 — ( Yim  -  Ytm)  + - - - *-  >  log  - - -  for  all  JT,  €  5m+i,  t  ^  t. 

*  4  1  —  p 

Stop  sampling  if  either  |Sm+i|  =  1  or  Sm+i  does  not  contain  any  unlabeled  population, 
and  make  the  terminal  decision :  "Select  all  populations  in  &m+ 1”»  otherwise,  go  to  stage 
m  +  1. 


It  should  be  noted  that  a  population  is  not  labeled  until  and  unless  it  qualifies  to  be 
called  good.  Any  population,  once  labeled,  is  not  examined  for  labeling  again.  However, 
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it  is  possible  that  a  labeled  population  is  eliminated  subsequently.  The  populations  that 
are  selected  by  the  terminal  decision  are  precisely  those  which  have  been  found  good  at 
some  stage  and  which  have  survived  elimination.  The  choice  of  61  in  (0,6*/2)  assures  that 
the  sequential  procedure  terminates  with  probability  one.  The  procedure  guarantees  a 
minimum  probability  P*  for  selecting  a  subset  consistent  with  the  goal.  An  optimal  choice 
of  6 1 ,  however,  is  an  open  question. 

Finally,  it  should  be  pointed  out  that  Gupta  and  Liang  (1988c)  have  discussed  the 
procedure  more  generally  for  location  and  scale  parameter  families. 

5.2.  Selection  Procedures  for  the  Exponential  Family 

Gupta  and  Miescke  (1984)  studied  sequential  selection  for  exponential  family  under 
a  decision-theoretic  framework.  Their  treatment  includes  multi-stage  selection  and  their 
results  relate  to  selection  of  subsets  of  random  as  well  as  fixed  sizes. 

Consider  the  one-parameter  exponential  family  7  given  by 

7  =  {c(0)exp(0x)h(x),  x  €  J2}*=e 

where  9  C  R  is  an  interval.  We  consider  the  class  of  permutation  invariant  sequential  pro¬ 
cedures  with  or  without  elimination,  employing  vector-at-a-time  sampling,  which  means 
that  a  vector  of  observations  (one  from  each)  is  taken  from  the  non-eliminated  popula¬ 
tions.  Let  Xn,Xi2,...  be  a  sequence  of  observations  from  x*  (with  associated  parameter 
0»).  At  stage  m  (m  =  1, 2, . . .),  let  nm  observations  be  taken  from  eligible  populations.  Let 
Wim  =  2J  Xij,  where  Nm  =  2J  nj ,  be  a  sufficient  statistic  for  0,  ,  based  on  all  observations 

j=rl  jm  1 

from  x,  through  stage  m,  and  let  Wm  =  (Wlm, . . . ,  W*m),  m  =  1, 2, . . .. 

Let  J  —  1  ,...,m,  denote  the  subset  of  {xj,. . . ,x*}  that  is  eliminated  at  stage 
j,  and  im+i  denote  the  subset  finally  selected  at  termination.  This  yields  a  partition 
{<i,. •  •  of  {xi,...,x*}  which  will  be  called  a  record.  For  0  =  (0|,. ..,0*)  €  0  = 

e*,  I'm  (0» ti»  •  •  • » tm+i)  denotes  the  loss  incurred  when  the  procedure  stops  at  stage 
m  with  the  record  {t\, . . .  ,tm»tm+i}.  It  is  assumed  that  (a)  Lm  is  permutation  invariant, 
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and  (b)  Lm  increases  if  a  record  is  changed  so  that  a  better  population  is  eliminated  before 
an  inferior  one. 

A  natural  terminal  decision,  at  stage  m,  selects  only  those  populations  among  the 
noneliminated  ones  which  yielded  the  largest  values  of  W,m.  Gupta  and  Miescke  (Id84) 
have  shown  that  between  two  procedures  which  differ  only  in  their  terminal  decisions,  the 
procedure  that  employs  a  natural  terminal  decision  rule  has  a  smaller  risk. 

One  can  naturally  speculate  that,  within  stages  where  a  procedure  with  elimination 
does  not  stop,  natural  subset  selections  are  optimal  as  in  the  case  of  terminal  decisions. 
This  has  been  shown  to  be  true  by  Gupta  and  Miescke  (1984)  only  in  the  case  of  multi-stage 
procedures  with  sizes  of  the  subsets  selected  at  each  stage  fixed,  under  the  assumption  that 
7  is  strongly  unimodal  (i.e.  exponential  density  is  logconcave).  For  additional  comments, 
see  Miescke  (1984). 

For  the  exponential  family,  Liang  (1988)  considered  the  goal  of  selecting  the  best 
population  and  excluding  all  that  are  not  good  (same  goal  as  that  of  Rn-.gl  discussed 
in  Section  5.1).  His  sequential  procedure  with  elimination  is  based  on  certain  conditional 
likelihood  functions  and  it  achieves  the  ^'-requirement  for  CS(£*). 

5.3.  Other  Developments 

There  are  other  recent  developments  concerning,  among  other  things,  truncated  ver¬ 
sions  of  earlier  open  sequential  procedures,  improvements  in  Paulson’s  (1964)  procedure, 
and  two-factor  model  with  no  interaction.  For  a  discussion  of  these  and  other  develop¬ 
ments,  see  Gupta  and  Panchapakesan  (1990). 

6.  Lower  Confidence  Bounds  for  the  Probability  of  a  Correct  Selection 

Let  1,...  ,n,  be  a  sample  of  size  n  from  a  population  where  xi ,...,** 

are  independently  distributed  with  continuous  distribution  function  G(x  -  #,■),  1  <  *  < 
k.  Let  0[jj  <  . . .  <  ffftj  denote  the  ordered  9%.  The  population  associated  with  0[*j  is 
called  the  best  population.  Assume  that  the  experimenter  is  interested  in  the  selection 
of  the  best  population.  For  this  purpose,  an  appropriate  statistic  T,  =  Y(Xn,. . . ,  AT,n) 
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with  cumulative  distribution  function  Fn(y  —  0,)  is  chosen,  and  the  natural  selection  rule 
that  selects  the  population  yielding  the  largest  Y,  as  the  best  population  is  applied.  Let 
CS  (correct  selection)  denote  the  event  that  the  best  population  is  selected.  Then,  the 
probability  of  a  correct  selection  (PCS)  applying  the  natural  selection  rule  is:  For  6  - 
9k),  * 

p,{  cs>=  r  nn(»+#w-«M)<»n(»).  (6.i) 

J-°°  »= i 

To  guarantee  the  PCS,  Bechhofer  (1054)  introduced  the  indifference  zone  approach  in 
which  the  experimenter  is  asked  to  assign  a  positive  value  6*  such  that  0[*]  -  0[*_i]  >  6*. 
However,  in  a  real  situation,  it  may  be  hard  to  assign  the  value  of  6*  such  that  —6^- 1]  > 
6*,  since  the  parameter  values  0i,...,9k  are  unknown.  So  that  if  the  above  assumption  is 
not  satisfied,  the  PCS  cannot  be  guaranteed  to  be  at  least  equal  to  the  prespecified  level. 
Parnes  and  Srinivasan  (1986)  have  pointed  out  certain  inconsistencies  in  the  indifference 
zone  formulation  of  certain  selection  problems.  Also,  see  Fabian  (1962)  and  Hsu  (1981) 
for  some  possible  ways  to  be  out  of  this  impasse. 

Retrospective  analyses  regarding  the  PCS  have  been  studied  by  several  authors.  Olkin, 
Sobel  and  Tong  (1976,1982)  have  presented  estimators  of  the  PCS.  Faltin  and  McCulloch 
(1983)  have  studied  the  small-sample  properties  of  the  Olkin-Sobel-Tong  estimators  for 
k  =  2  case.  Bofinger  (1985)  has  discussed  the  nonexistence  of  consistent  estimators  of 
the  PCS.  Gutmann  and  Maymin  (1987)  have  presented  a  procedure  to  test  whether  the 
selected  population  is  the  best.  Anderson,  Bishop  and  Dudewicz  (1977)  have  given  a  lower 
confidence  bound  on  the  PCS  in  normal  distribution  models. 

In  the  following,  we  will  review  some  recent  developments  regarding  the  construction 
of  lower  confidence  bounds  for  the  PCS. 

6.1.  A  Lower  Confidence  Bound  on  PCS  for  Distributions  with  MLR  Property 

In  (6.1),  replace  0|a]  -  0|<]  by  0^]  -  0[a-i]  for  each  i  =  1, ...  ,k  -  2.  Then,  one  can 
obtain  an  inequality 

P9{ CS}  >  r  [F»(y  +  0W  -  0,fc.l,)J*-1dFW(y).  (6.2) 

•  J— OO 
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Kim  (1986)  proposed  a  method  to  find  a  conservative  lower  confidence  bound  on  the 
PCS  by  first  finding  a  lower  confidence  bound  on  —  0|k_,j. 


Let  H(t)  be  the  distribution  of  ( Y\  —  $i)  —  (Y2  —  02).  That  is, 


Fn(*  +  v)dFn(y). 


(6.3) 


The  distribution  -ff(i)  is  independent  of  the  parameters  9\  and  $2,  and  H(x)  is  symmetric 
about  the  point  0.  For  0  <  a  <  1,  let  ta/2  be  the  upper  a/2-quantile  of  the  distribution 
H(x).  By  the  symmetric  property  of  H(x)t  t«/2  >  0.  For  this  fixed  a,  define  a  nonnegative 
function  La{t)  on  [0,oo)  implicitly  by 


H(La(t)  - 1)  +  H(-La{t)  -  0  = a  for  t>  ta/2  (6.4) 

and  let  La(t)  =  0  if  0  <  t  <  ta/3.  Let  Yjij  <  ...  <  Y[*]  denote  the  ordered  statistics 
of  Yj, . . . ,  Yfc.  Also,  let  fH  be  the  associated  pdf  of  the  distribution  function  Fn.  Finally, 
define 

h  =  r  (Fn(v + £„(yj»,  -  r,».,  !))]*-•<«•.(»).  (6.5) 

J-00 

Theorem  6.1  (Kim  (1086)).  Assume  that  log/»(y)  is  concave.  Then, 

po{°[*\  ~  *[*-1]  -  L*(Y\k)  ~  ^*-1])}  =  1  -  a, 

and  hence, 

P,{P,{CS>  >  Pi)  >  1  -  a  for  all  9. 


6.1.1.  Normal  Populations  with  a  Common  Variance 

Let  Xi- ,  j  =  l,...,n,  be  a  sample  of  sixe  n  from  N{9i,a3),  »  =  1  ,...,*,  where  the 

common  variance  a3  may  be  either  known  or  unknown.  The  best  population  is  the  one 

» 

associated  with  fuj.  Let  Y,-  =  ^  be  the  sample  mean  for  each  t  =  1  ,...,£.  The 

natural  selection  rule  selects  the  population  yielding  the  largest  sample  mean  value  Y[*]  as 
the  best  population.  The  PCS  is: 
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where  $(•)  is  the  standard  normal  distribution. 

When  the  common  variance  a2  is  known,  for  0  <  a  <  1,  the  function  La(t)  is  implicitly 
defined  such  that 


$(L«(t)  -  t)  +  *(-La(t)  -  t)  =  a  for  t  >  za/2. 


and  La(t)  =  0  for  0  <t  <  za/2,  where  za/2  is  the  upper  a/2-quantile  of  $(•).  Kim  (1986) 
obtained  a  lower  confidence  bound  for  the  PCS  which  is  given  as  follows: 

and  P§{P${ CS}  >  Pl}  >  1  -  *  for  all  9. 

k  n 

When  the  common  variance  a2  is  unknown,  let  S2  =  £  £  H  (Xij  ~  *»)2>  where 

t=i,=i 

v  =  k(n  —  1).  Note  that  has  a  xa -distribution  with  v  degrees  of  freedom.  Let  Q„ 
denote  the  distribution  of  S/a.  For  given  0  <  a  <  1,  let  L*  (*)  be  the  function  implicitly 
defined  by 


/°°(*(L;(i)  -  *“)  +  #(-J£W  -  ftt)]dQ„(u)  =  a  for  t  >  ta/2 {*) 

Jo 

and  La(t)  =  0  for  0  <  t  <  ta/2(i/),  where  ta/2(u)  is  the  upper  a/2-<iuantile  of  the  t- 
distribution  with  u  degrees  of  freedom.  Kim  (1986)  obtained  a  lower  confidence  bound  for 
the  PCS  as  follows: 


(sS^))] 


<»(*), 


and  Pff{P0{ CS}  >  P£>  >  1  -  a  for  all  9. 

The  table  used  to  implement  the  procedures  has  been  tabulated  by  Kim  (1986)  for 
a  =  0.5  and  0.1  for  some  1/  values. 


0.1.2.  Two-Parameter  Exponential  Populations 

Let  XijJ  =  l,...,n,  be  a  sample  of  sise  n  from  a  two-parameter  exponential  dis¬ 
tribution  with  pdf  g{x\9ufJ)  =  P~l  exp(-(x  -  #,)//?)  J(#<i00)(x),  i  =  where  the 
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common  scale  parameter  0  >  0  may  be  either  known  or  unknown.  The  best  population 
is  the  one  associated  with  •[*].  Let  F,  =  min(X,i,...  ,AT,n),  *  =  1  ,...,£.  The  natural 
selection  rule  selects  the  population  yielding  Fj*]  as  the  best  population.  The  PCS  is: 

^{CS}  =  f  JJ  [l  -  exp(— (y  +  n(0w  -  tf[, -])/£))]  e~vdy. 

Jv=o  i=1 


Let  H(t)  be  the  distribution  function  of  —  zlXt-Jil,  Then, 


if  t  >  0; 
if  t  <  0. 


When  the  common  scale  parameter  0  is  known,  for  0  <  a  <  1,  let  ta/2  denote  the 
upper  a/2-quantile  of  H(t).  Then,  the  function  La(t)  is  implicitly  defined  by 

H(La(t)  -  t)  +  H(-La{t)  -  t)  =  a  for  t  >  ta/2 

and  La{t)  —  0  for  0  <  t  <  ta/2. 

Gupta,  Leu  and  Liang  (1990)  obtained  a  lower  confidence  bound  for  the  PCS  as 
follows: 

Pl  =  [  [1  -  exp(-y  -  LQ(n(F[fc]  -  F(fc_i])//?))]fc_1e_ydy, 

Jy- o 

and  /^{/^{CS}  >  Pl}  >  1  —  a  for  all  9. 

k  n 

When  the  common  scale  parameter  0  is  unknown,  let  S  =  £  52  (X\j  -  F<),  where 

i=l;  =  l 

u  =  k(n  —  1).  Then  ^  has  a  T(u,  1)  distribution.  Let  <?„(•)  denote  the  distribution  of 
S/0.  For  0  <  a  <  1,  let  C/2  be  the  point  such  that  /0°°  H{— t*a/2y)dQv{v)  =  a/2.  The 
function  L*  (t)  is  then  implicitly  defined  by 

-  yt)  +  H(-L*a(t)  -  yt)]dQv{y)  =  a  for  t  >  Sa/2, 

Jo 

and  L;(t)  =  0  for  0  <  t  <  t*a/r 

Gupta,  Leu  and  Liang  (1990)  obtained  the  following  lower  confidence  bound  for  the 

PCS: 

Pi  =  /“[i  -  rl,.1|)/s))|*-'«-M», 

Jo 
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and  Ptf{P*{CS>  >  PI)  >  1  —  a  for  all  8. 

The  table  used  to  implement  the  procedures  has  been  tabulated  by  Gupta,  Leu  and 
Liang  (1990)  for  a  =  0.05  and  0.1  for  some  u  values. 


0.2.  Lower  Confidence  Bounds  on  the  PCS  for  General  Location-Parameter 
Models 


Gupta  and  Liang  (1987)  have  constructed  lower  confidence  bounds  on  the  PCS  for 
general  location-parameter  models,  where  the  sample  size  n  is  determined  according  to 
the  indifference  zone  formulation  of  Bechhofer  (1954).  Note  that 

a  inf  P*{CS}  =  ^  [Fn(y  +  ni^^y),  (6-6) 

0en(f.,  -  J — oo 


where  fi(5*)  =  {<?|0[k]  —  ^[fc-i]  >  6*}  is  called  the  preference  zone.  Suppose  that  the  right- 
hand-side  of  (6.6)  is  an  increasing  function  of  n,  and  tends  to  one  as  n  tends  to  infinity. 
For  a  given  probability  level  P*(k~1  <  P*  <  1),  let 


n0  =  n0{6*,P 


•)=mm  {»:/_“ 


[Pn(y  +  «*)]‘-1dFn(y) 


>P*  j. 


(6.7) 


That  is,  no  is  the  minimum  common  sample  size  so  that  the  PCS  will  be  guaranteed  at 
least  to  be  P*  when  the  natural  selection  rule  is  applied  and  6  E  D(£*). 


Let  Yji|  <  . . .  <  Y[k]  denote  the  ordered  statistics  of  Y\, ,  Yk.  For  given  0  <  a  <  1, 
let  c(k,no,ot)  be  the  value  such  that 

P$  { i ~  #<)  “  ~  §i)  ^  c(*>  no, «)  |  =  1  -  «.  (6.8) 

Note  that  the  value  of  e(Jb,  no,  a)  is  independent  of  8.  Define 

hi  =  (fy]  -  KM  -  c(fc,no,o))+,  (6.9) 


where  y+  =  max(0,y),  and 


pL= f°° 

J-OO  iml 


(y). 


(6.10) 
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Gupta  and  Liang  (1987)  proposed  Pl  as  an  estimator  of  a  lower  bound  of  the  PCS, 
and  obtained  the  following  result. 

Theorem  6.2  (Gupta  and  Liang  (1087)). 

P${0\k]  ~  ^[t]  >  h{  for  all  i  =  1, . . . ,  k  -  1}  >  1  -  a  for  all  0, 

and  therefore, 

P^P^CS}  >  Pl}  >  1  -  a  for  all  6. 


6.2.1.  Normal  Population  with  a  Common  Variance 

Consider  k  normal  populations  N($i, a3),i  =  1  with  unknown  means  0i , . . . ,  0* 

and  common  variance  <r2,  where  a3  may  be  either  known  or  unknown. 

i  n° 

When  the  common  variance  a2  is  known,  let  52  -X*j»  where  Xu, . . .  ,X,„0  is 


/=! 


a  sample  of  size  no  from  W(0,,<72)  and  no  is  determined,  for  the  indifference  zone  function, 
by 


no 


=  min  jn:^00  +  d$(x)>P*j 


where  both  S*(>  0)  and  P*(k~1  <  P*  <  1)  are  prespecified  by  the  experimenter.  The 
PCS  applying  the  natural  selection  rule  is 

P*{CS}  =  l°°  fj  *  ^x  +  <£$(x). 


For  given  0  <  a  <  1,  choose  the  value  e(k,  no,  a)  such  that 

P$  ~  #<)  ~  gjfcJPi  ~  9A  ~  c(*»*o,a)}  =  1  -  a. 

Note  that  here,  c(k,no,a)  =  where  q£)0O  is  the  100(1  -  a)%th  percentile  of 

Tukey’s  studentized  range  statistic  with  parameter  (k,oo).  The  value  of  ?*(00  is  available 
from  Harter  (1965).  Define 

ht  =  (Fj*|  -  Y\i]  -  e(k,n o,a))+ 

and 
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^=/”n  ♦(*+*— A  <»(*)• 

Then,  by  Theorem  6.2,  P^{F^{CS}  >  Pl)  >  1  —  a  for  all  0. 

When  the  common  variance  a2  is  unknown,  Bechhofer,  Dunnett  and  Sobel  (1954) 
presented  a  two-stage  selection  rule  given  as  follows. 


Take  a  sample  of  size  n0(no  >  2)  observations  from  each  of  the  k  populations. 
Compute  Vi (n0)  =  ^  E  x*i »  *'=  1. •••,*,  and  53  =  ^  E  E  (xij  ~  ^(«o)),  where 

i= l  »=i/=i 

u  =  Jk(no  —  1).  Define  N  =  max  jno,  |>  where  [y]  is  the  smallest  integer  not  less 

than  y,  and  h  is  a  positive  value  such  that 

f  f  ($(z  +  wh)]*-1d$(x)dFw(w)  =  P* 

Jo  J- oo 

where  fW(-)  is  the  distribution  function  of  the  nonnegative  random  variable  W  with  uW2 
following  a  x2  (^J-distribution. 


Then,  take  N  —  no  observations  from  each  of  the  k  populations.  Compute  the  overall 

N 

sample  mean  Yi[N)  =  b  E  x.„  i  =  i . k.  The  natural  selection  rule  selects  the 

;=i 

population  yielding  the  largest  sample  mean  value  yjfc](JV)  as  the  best  population. 

For  this  two-stage  selection  rule, 

P*{CS}  >  J~  J°°  n  *  (*  +  d*(x)dFw(w). 

Let  c  =  Sqit^/VN,  where  is  the  100(1  —  o)%th  percentile  of  Tukey’s  studentized 
range  statistic  with  parameters  (k,u).  Define  In  =  {Y  [k](-W)  ~  V  [ij(-AT)  —  c)+.  Let 

Q l  =  J"  f°°  n  *  ^x  +  d*(x)dFw[w). 

Gupta  and  Liang  (1987)  obtained  the  following  lower  confidence  bound  on  the  PCS: 
P${0\k\  ~  0[t]  >  $Li  for  all  *  =  1, . . . ,  k  —  1}  >  1  —  a  for  all  0, 

and  therefore, 

P${P§{ CS}  >  >  1  -  a  for  all  0. 
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